Stream selection and integration in multistream ASR using GMM-based performance monitoring

نویسندگان

  • Tetsuji Ogawa
  • Feipeng Li
  • Hynek Hermansky
چکیده

A moderately deep and rather wide artificial neural net is applied in phoneme recognition of noisy speech. The net is formed by first estimating posterior probabilities of phonemes in 21 band-limited streams covering the whole speech spectrum. These 21 band-limited streams are subdivided into three seven band-limited stream subsets, by differently sub-sampling the original 21 band-limited streams. In the second processing stage, all non-empty combinations of seven band-limited streams from each subset are formed as inputs to 127 artificial neural nets that are again trained to yield phoneme posteriors. In this way, 127 × 3 = 381 processing streams are formed. A novel technique for finding the best combination of the resulting 381 parallel processing streams, which uses the likelihood of a single-state Gaussian mixture model of the final classifier output is applied to selecting the most efficient streams. The technique is efficient in phoneme recognition of speech that is corrupted by realistic additive noise.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Framework for Practical Multistream ASR

Robustness of automatic speech recognition (ASR) to acoustic mismatches can be improved by using multistream architecture. Past multistream approaches involve training large number of neural networks, one for each possible stream combination. During testing phase, each utterance is forward passed through all the neural networks to estimate best stream combination. In this work, we propose a new...

متن کامل

New entropy based combination rules in HMM/ANN multi-stream ASR

Classifier performance is often enhanced through combining multiple streams of information. In the context of multistream HMM/ANN systems in ASR, a confidence measure widely used in classifier combination is the entropy of the posteriors distribution output from each ANN, which generally increases as classification becomes less reliable. The rule most commonly used is to select the ANN with the...

متن کامل

Acoustic model selection for recognition of regional accented speech

Accent is cited as an issue for speech recognition systems [1]. Research has shown that accent mismatch between the training and the test data will result in significant accuracy reduction in Automatic Speech Recognition (ASR) systems. Using HMM based ASR trained on a standard English accent, our study shows that the error rates can be up to seven times higher for accented speech, than for stan...

متن کامل

Integration of language identification into a recognition system for spoken conversations containing code-Switches

This paper describes the integration of language identification (LID) into a multilingual automatic speech recognition (ASR) system for spoken conversations containing code-switches between Mandarin and English. We apply a multistream approach to combine at frame level the acoustic model score and the language information, where the latter is provided by an LID component. Furthermore, we advanc...

متن کامل

Speaker adaptation for audio-visual speech recognition

In this paper, speaker adaptation is investigated for audiovisual automatic speech recognition (ASR) using the multistream hidden Markov model (HMM). First, audio-only and visual-only HMM parameters are adapted by combining maximum a posteriori and maximum likelihood linear regression adaptation. Subsequently, the audio-visual HMM stream exponents are adapted to better capture the reliability o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013